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Abstract 

Despite the long list of publications on parser combinators, there does 
not yet exist a monadic parser combinator library that is applicable in real 
world situations. In particular naive implementations of parser combina- 
tors are likely to suffer from space leaks and are often unable to report 
precise error messages in case of parse errors. The Parsec parser com- 
binator library described in this paper, utilizes a novel implementation 
technique for space and time efficient parser combinators that in case of 
a parse error, report both the position of the error as well as all grammar 
productions that would have been legal at that point in the input. 



1 Introduction 

Parser combinators have always been a favorite topic amongst functional pro- 
grammers. Burge (1975) already described a set of combinators in 1975 and they 
have been studied extensively over the years by many others (Wad, 1985; Hut, 
1992; Fok, 1995; HM, 1996). In contrast to parser generators that offer a fixed 
set of combinators to express grammars, these combinators are manipulated as 
first class values and can be combined to define new combinators that fit the 
application domain. Another advantage is that the programmer uses only one 
language, avoiding the integration of different tools and languages (Hug, 1989). 
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Despite the theoretical benefits that parser combinators offer, they are hardly 
used in practice. When we wrote a parser for the language XMA (SM, 2001) 
for example, we had a set of real-world requirements on the combinators. They 
had to be monadic in order to make the parse context sensitive, they had to be 
efficient (ie. competitive in speed with happy (GM, 1995) and without space 
leaks) and they had to return high quality error messages. To our surprise, 
most current monadic parser libraries suffered from shortcomings that made 
them unsuitable for our purposes; they are not efficient in space or time, and 
they don't allow for good error messages. 

There has been quite a lot of research on the efficiency of parsers combinators 
(KP, 1999; PW, 1996; 

R6j, 1995; Mei, 1992) but those libraries pay almost no attention to error 
messages. Recently, Swierstra et al. (1996; 1999) have developed sophisticated 
combinators that even perform error correction but unfortunately they use a 
non- monadic formulation and a separate lexer. 

This paper describes the implementation of a set of monadic parser combinators 
that are efficient and produce good quality error messages. Our main contribu- 
tion is the overall design of the combinators, more specifically: 



• We describe a novel implementation technique for space and time efficient 
parser combinators. Laziness is essential ingredient in the short and con- 
cise implementation. We identify a space leak that contributes largely to 
the inefficiency of many existing parser combinators described in litera- 
ture. 



• We show how the primitive combinators can be extended naturally with 
error messages. The user can label grammar production with suitable 
names. The messages contain not only the position of the error but also 
all grammar productions that would have been legal at that point in the 
input - i.e. the first-set of that production. 



The combinators that are described in this paper have been used to implement 
a 'real world' parser library in Haskell that is called parsec. This library 
is available with documentation and examples from http://www.cs.uu.nl/ 
"daan/parsec .html and is distributed with the GHC compiler. 

Throughout the rest of the paper we assume that the reader is familiar with 
the basics of monadic combinator parsers. The interested reader is referred to 
Hutton and Meijer (1996) for a tutorial introduction. 



2 Grammars and Parsers 
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2 Grammars and Parsers 

The following sections discuss several important restrictions and other char- 
acteristics of existing parser combinator libraries that influenced the design of 
Parsec. 



2.1 Monadic vs. Arrow style Parsers 

Monadic combinator parsers consist of a monad Parser a (typically of the form 
String — > Result a for some functor Result) with a unit return and bind 
(»=) operation, and a number of parser specific operations, usually a choice 
combinator (<l>) and a function satisfy for construction elementary parsers 
for terminal symbols: 

type Parser a 

return : : a — > Parser a 

(>>=) : : Parser a — > (a — > Parser b) — > Parser b 

satisfy : : (Char — > Bool) — > Parser Char 
(<|>) :: Parser a — > Parser a — > Parser a 

An important practical benefit of monadic parser combinators is the fact that 
Haskell has special syntax (the do-notation) that greatly simplifies writing monadic 
programs. However, there are also deeper reasons why we prefer using monadic 
combinators. 

Besides bind, there exists another important form of a sequential combinator 
(<*>) which is described by Swierstra and Duponcheel (1996) and later identi- 
fied as a special case of an arrow-style combinator by Hughes (2000) . The types 
of the monadic and arrow-style combinators are closely related, but subtly dif- 
ferent: 

(<*>) : : Parser a — > Parser (a — ► b) — > Parser b 
(>>=) : : Parser a — > (a — ► Parser b) — * Parser b 

However, their runtime behavior differs as much as their types are similar. Due 
to parametricity (Wad, 1989), the second parser of (<*>) will never depend on 
the (runtime) result of the first. In the monadic combinator, the second parser 
always depends on the result of the first parser. An interesting relation between 
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both forms follows directly from their type signatures; arrow-style parser com- 
binators can at most parse languages that can be described by a context-free 
grammar while the monadic combinators can also parse languages described by 
context-sensitive grammars. 

Since parsers described with arrow-style combinators never depend on run-time 
constructed values, it is possible to analyze the parsers before executing them. 
Swierstra and Duponcheel (1996) use this characteristic when they describe 
combinators that build lookup tables and perform dynamic error correction. 

Monadic combinators are able to parse context sensitive grammars. This is not 
just a technical nicety. It can be used in many situations that are tradionally 
handled as a separate pass after parsing. For example, if plain XML documents 
are parsed with a context-free parser, there is a separate analysis needed to 
guarantee well-formed ness, i.e. that every start tag is closed by a matching end 
tag. 

A monadic parser can construct a specialized end tag parser when an open tag 
is encountered. Given an openTag parser that returns the tag name of a tag 
and an endTag parser that parses an end tag with the name that is passed as 
an argument, an XML parser that only accepts well-formed fragments can be 
structured as follows: 

xml = do{ name <- openTag 
; content <- many xml 
; endTag name 

; return (Node name content) 
} 

< I > xmlText 



2.2 Left recursion 

An important restriction on most existing combinator parsers (and Parsec is no 
exception) is that they are unable to deal with left-recursion. The first thing a 
left-recursive parser would do is to call itself, resulting in an infinite loop. 

In practice however, grammars are often left-recursive. For example, expression 
grammars usually use left-recursion to describe left-associative operators. 

expr ::= expr "+" factor 
factor ::= number I "(" expr ")" 

As it is, this grammar can not be literally translated into parser combinators. 
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Fortunately, every left-recursive grammar can be rewritten into a right-recursive 
one (ASU, 1986). It is also possible to define a combinator chainl (Fok, 1995) 
that captures the design pattern of encoding associativety using left-recursion 
directly, thereby avoiding a manual rewrite of the grammar. 



2.2.1 Sharing 

One could think that the combinators themselves can observe that expr is left- 
recursive, and thus could prevent going into an infinite loop. In a pure language 
however, it is impossible to observe sharing from within the program. It follows 
that parser combinators are unable to analyze their own structure and can never 
employ standard algorithms on grammars to optimize the parsing process. 

All combinator libraries are forced use a predictive parsing algorithm, also known 
as left-to-right, left-most derivation or LL parsing (ASU, 1986). (LR parsing 
is still the exclusive domain of separate tools that can analyze the grammar 
on a meta-level.) However, Claessen and Sands (1999) describe an interesting 
approach to observable sharing in the context of hardware descriptions which 
might be used in the context of parser combinators to analyze the structure of 
a parser at run-time. 



2.3 Backtracking 

Ambiguous grammars have more than one parse tree for a sentence in the lan- 
guage. Only parser combinators that can return more than one value can handle 
ambiguous grammars. Such combinators use a list as their reply type. 

In practice however, you hardly ever need to deal with ambiguous grammars. In 
fact it is often more a nuisance than a help. For instance, for parser combinators 
that return a list of successes, it doesn't matter whether that list contains zero, 
one or many elements. They arc all valid answers. This makes it hard to give 
good error messages (see below) . Furthermore it is non-trivial to tame the space 
and (worst case exponential) time complexity of full backtracking parsers. 

However, even when we restrict ourselves to non-ambiguous grammars, we still 
need to backtrack because the parser might need to look arbitrary far ahead in 
the input. 

Naive implementations of backtracking parser combinators suffer from a space 
leak. The problem originates in the definition of the choice combinator. It cither 
always tries its second alternative (because it tries to find all possible parses), 
or whenever the first alternative fails (because it requires arbitrary lookahead). 
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As a result, the parser (p < I > q) holds on to its input until p returns, since it 
needs the original input to run parser q when p has failed. The space leak leads 
quickly to either a stack/heap overflow or reduction in speed on larger inputs. 



2.4 Errors 

Parsers should report errors when the input does not conform to the grammar. 
A good parser error message contains the position of the error in the input as 
well as the cause of the error. Besides the cause of an error, the message can 
contain all possible productions that would have been legal at that point in the 
input. These correspond to the first set of a non-terminal. 

Beside error reporting the parser might try to correct the error. After detecting 
an error, the input is modified by deleting or inserting tokens which might lead 
to valid input again. Swierstra and Duponcheel (1996) describe how automatic 
error correction can be implemented with arrow-style parser combinators. 

As explained above, current (nondeterministic) parser combinators are not very 
good at reporting errors. The combinators report neither the position nor the 
possible causes of an error. It is hard to report an error since the the parsers can 
always look arbitrarily far ahead in the input (they are LL(oo) and it becomes 
hard to decide what the error message should be. 

It is for the two reasons above that in Parsec we restrict ourselves to predictive 
parsers with limited lookahead. The < I > combinator is left-biased and will return 
the first succeeding parse tree (i.e. even if the grammar is ambiguous only 
one parse tree is returned). The Parsec combinators will report all possible 
causes of an error. The messages can be customized by the user - instead of 
giving the error message on the character level it contains a grammar production 
description. 



2.5 LL Grammars 

The following sections derive a space efficient and error reporting combinator 
parsers. The space leak can be fixed by restricting the lookahead. As a side 
effect this also improve the quality of the error messages that are implemented 
later in this paper. 

LL grammars have the distinctive properties that they are non-ambiguous and 
not left-recursive. A grammar is LL(fc) if the associated predictive parser needs 
at most k tokens lookahead to disambiguate a parse tree. For example, the 
following grammar is LL(2): 
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S : := PQ I Q 
P : := "p" 
Q ::= "pq" 

When a the first token is "p",F we still don't know if we are in the PQ or Q 
production, only upon seeing the second token ("p" or "q") we know what to 
choose. 

The usual list of successes combinators have the interesting property that they 
have a dynamic lookahead to an arbitrary large k; We will call this an LL(oo) 
parser. The combinators will look arbitrarily far ahead due to the definition 
of the (<l>) combinator. Whenever the first parser fails, the second will be 
tried instead, no matter how many tokens the first parser has consumed! The 
previous grammar can be translated literally into combinators: 

s = do{ p; q > < I > q 
p = char 'p' 

q = do{ char 'p'; char 'q' } 

Unfortunately, this doesn't hold in general. There is a specific case where we 
can't literally translate the grammar. Here is the previous grammar again writ- 
ten in a slightly different way: 

S : := PQ 

P : := "p" I e 

Q ::= "pq" 

When we literally translate this grammar we get: 

s = do{ p; q } 

p = char ' p ' < I > return ' p ' 

q = do{ char ' p'; char ' q' } 

The < I > combinator is now local to the p parser. It returns a result right after 
the first character is consumed. If the input was "pq" it will recognize the 
"p" character as part of the p production and fail when trying q! The <l> 
combinator should be used at the point where lookahead is actually needed and 
can not be used locally in the production. 

In general, every PQ where P e (i.e. P has an empty derivation) and where 
first(P) U first(Q) ^ 0 (i.e. their first-sets overlap (ASU, 1986)), should be 
rewritten to P'Q I Q where P' equals production P but no longer includes an 
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e derivation. If a grammar is left-factored (ASU, 1986) this transformation 
happens automatically. 

LL(oo) is a powerful grammar class. Any non-ambiguous context-free grammar 
can be transformed into an LL(oo) grammar. In practice, there are many lan- 
guages that require arbitrary lookahead; for example, type signatures in Haskell 
or declarations in C. 



3 Restricting lookahead 

The following sections will focus on implementing a set of monadic combina- 
tors that circumvent the space leak of naive combinators and add good error 
messages. 

To solve the space leak of the naive parser combinators, we turn to deterministic 
predictive parsing with limited lookahead. An LL(1) parser has a lookahead of a 
single token - it can always decide which alternative to take based on the current 
input character. In practice this means that the parser (p < I > q) never tries 
parser q whenever parser p has consumed any input. 

To use an LL(1) strategy, each parser is keeps track of its input consumption. 
We call this the consumer-based approach. A parser has either Consumed input 
or returned a value without consuming input, Empty. The return value is either 
a single result and the remaining input, Ok a String, or a parse error, Error: 

type Parser a = String — > Consumed a 

data Consumed a = Consumed (Reply a) 
I Empty (Reply a) 

data Reply a = Ok a String I Error 

Note that the real Parsec library is parameterized with the type of the input 
and a user definable state. 



3.1 Basic combinators 

Given the concrete definition of our Parser type, we can now turn to the im- 
plementation of the basic parser combinators. 
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Figure 1: Input consumption of (>>=) 



The return combinator succeeds immediately without consuming any input, 
hence it returns the Empty alternative: 

return x 

= \input -> Empty (Ok x input) 

The satisfy combinator consumes a single character when the test succeeds 
but returns Empty when the test fails, or when it encounters the end of the 
input: 

satisfy : : (Char — * Bool) — * Parser Char 
satisfy test 

= \input -> case (input) of 
[] -> Empty Error 

(c:cs) I test c -> Consumed (Ok c cs) 
I otherwise -> Empty Error 

With the satisfy combinator we can already define some useful parsers: 

char c = satisfy (==c) 
letter = satisfy isAlpha 
digit = satisfy isDigit 

The implementation of the (>>=) combinator is the first one where we take 
consumer information into account. Figure 1 summarizes the input consumption 
of a parser (p >>= f ). If p succeeds without consuming input, the result is 
determined by the second parser. However, if p succeeds while consuming input, 
the sequence starting with p surely consumes input Thanks to lazy evaluation, 
it is therefore possible to immediately build a reply with a Consumed constructor 
even though the final reply value is unknown. 

(»=) : : Parser a — ► (a — ► Parser b) — * Parser b 
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p »= f 

= \input -> case (p input) of 
Empty replyl 

-> case (replyl) of 

Ok x rest -> ((f x) rest) 
Error -> Empty Error 



Consumed replyl 
-> Consumed 

(case (replyl) of 
Ok x rest 

-> case ((f x) rest) of 

Consumed reply2 -> reply2 
Empty reply2 -> reply2 
error -> error 

) 



Due to laziness, a parser (p >>= f ) directly returns with a Consumed construc- 
tor if p consumes input. The computation of the final reply value is delayed. 
This 'early' returning is essential for the efficient behavior of the choice combi- 
nator. 

An LL(1) choice combinator only looks at its second alternative if the first hasn't 
consumed any input - regardless of the final reply value! Now that the (»=) 
combinator immediately returns a Consumed constructor as soon as some input 
has been consumed, the choice combinator can choose an alternative as soon 
as some input has been consumed. It no longer holds on to the original input, 
fixing the space leak of the previous combinators. 



(<|>) : : Parser a — > Parser a — > Parser a 
p <|> q 

= \input -> case (p input) of 
Empty Error -> (q input) 
Empty ok -> case (q input) of 

Empty _ -> Empty ok 
consumed -> consumed 
consumed -> consumed 



Note that if p succeeds without consuming input the second alternative is fa- 
vored if it consumes input. This implements the "longest match" rule. 

With the bind and choice combinator we can define almost any parser. Here 
are a few useful examples: 
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string : : String -> Parser () 

string "" = return () 

string (c:cs) = do{ char c; string cs } 

manyl : : Parser a -> Parser [a] 
manyl p 

= do{ x <- p; 

; xs <- (manyl p <|> return [] ) 

; return (x:xs) 

} 

identifier 

= manyl (letter <|> digit <|> char '_') 

Note that the formulation of the manyl parser works because the choice combi- 
nator doesn't backtrack anymore. 



3.2 Related work 

It is interesting to compare this approach with previous work on efficient parser 
combinators. Rojemo (1995) uses a continuation based approach in combination 
with a cut combinator. The cut combinator is used to implement an LL(1) 
variant of the choice combinator. A variant of Rqjemo's solution is given by 
Koopman and Plasmcijer (1999). In his thesis (1992), Meijer describes several 
alternative implementations of the cut combinator using continuation based 
parsers. 

The main contribution of this paper is the simplicity of the consumer based 
approach when compared to an implementation based on continuations. Due 
to laziness, the algorithm can be specified declaratively, while getting the same 
operational 'interleaved' behavior as with continuations. It is also easier to 
constructively add error messages to the combinators, which is done later in 
this paper. 

The consumer based design is perhaps most closely related to the work of Par- 
tridge and Wright (1996). They implement a predictive LL(1) parser using four 
return values in their parser reply: 

data Reply a = Ok a String 

I Epsn a String 
I Err 
I Fail 
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Figure 2: Comparison of libraries 



The Epsn (cpsilon) and Fail alternatives are used when the parser hasn't con- 
sumed any input. The correspondence with a consumer based design is clear: 



Partridge & Wright 


Consumer based design 


Ok x input 


= Consumed (0k x input) 


Epsn x input 


= Empty (0k x input) 


Err 


= Consumed (Error) 


Fail 


= Empty (Error) 



Unfortunately, the approach of Partridge and Wright still suffers from the space 
leak. The information about input consumption is tupled with the information 
about the success of the parser. The choice operator now holds on the input 
since information about both the success and the consumption of a parser is 
returned, which can only be done after a reply is completely evaluated. 



3.3 Measurements 

We have done some prclimary measurements on the effectiveness of the consumer 
based design. We took four different libraries and let them parse the standard 
libraries of the Zurich Oberon system (Wir, 1988). To make the test as honest 
as possible, we wrote the Oberon parser using standard arrow-style combinators 
and mapped the basic combinators of each library to these combinators. This 
enables each library to use the exactly the same parser sources. 

The libraries tested are: 

• parsec. The full Parsec library, including the error message mechanism 
that is developed later in this paper. The library can parse context- 
sensitive grammars with infinite lookahead. There are two variants tested, 
"parsec" is a version where the entire grammar, including the lexical part, 
is described using parser combinators and "parsec -(-scanner" is a version 
where a seperate hand-written scanner is used. 
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• uuparsing. A sophisticated arrow-style library developed at the University 
of Utrecht (SAA, 1999). A prominent feature is that the library automat- 
ically corrects the input on errors and (thus) always succeeds. The library 
parses context-free with infinite lookahead. The parser in our test uses a 
seperate hand-written scanner for Oberon. 

• parselib. The 'standard' monadic parser library that is distributed with the 
Hugs interpreter. This is a monadic parser library developed by Graham 
Hutton and Erik Meijer (HM, 1996). The library parses context sensitive 
with infinite lookahead and can even deal with ambigious grammars but 
gives no error messages at all. The entire grammar is described using 
parser combinators. 

Each library was compiled with GHC 5.02 with the -02 flag and tested against 
all 102 standard library files of the Zurich Oberon system. The largest of these 
files consists of 115,000 characters and 3302 lines, and the total line count is 
87,000. The libraries were run with a 64 Mb heap on a 550 MHz Pentium run- 
ning FreeBSD. Detailed results can be found at http://www.cs.uu.nl/~daan/ 
pbench.html. 

Figure 2 summarizes the results. It shows the average number of characters 
parsed per second, the number of bytes allocated per character and the number 
of bytes resident per character. The residency gives the maximal portion of the 
heap that was live during the execution of the program. 

The measurements should be interpreted with care since each library uses differ- 
ent parsing strategies and has different features. For example, in contrast to the 
other libraries, the ParseLib library can deal with ambigious grammars. The 
bottom line however is that each library uses exactly the same parser source to 
parse the same Oberon sources and it seems that the consumer based design 
pays off in practice. 



3.4 Infinite lookahead, again 

With all these optimization efforts, the parser combinators are now restricted 
to LL(1) grammars. Unfortunately, most (programming language) grammars 
are not LL(1) and even require arbitrary lookahead. 

Dually to the approach sketched in (R6j, 1995; 

HM, 1996; KP, 1999; Mei, 1992) where a special combinator is introduced to 
mark explicitly when no lookahead is needed, we add a special combinator to 
mark explicitly where arbitrary lookahead is allowed. 

The (try p) parser behaves exactly like parser p but pretends it hasn't con- 
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sumed any input when p fails: 

try : : Parser a — > Parser a 
try p 

= \input -> case (p input) of 

Consumed Error -> Empty Error 
other -> other 

Consider the parser (try p < I > q) . Even when parser p fails while consuming 
input (Consumed Error), the choice operator will try the alternative q since the 
try combinator has changed the Consumed constructor into Empty. Indeed, if 
you put try around all parsers you will have an LL(oo) parser again! 

Although not discussed in their paper, the try combinator could just as easily 
be applied with the four reply value approach of Partridge and Wright (1996), 
changing Err replies into Fail replies. The approach sketched here is dual 
to the three reply values of Hutton (1992). Hutton introduces a noFail com- 
binator that turns empty errors into consumed errors! It effectively prevents 
backtracking by manual intervention. 



3.5 Lexing 

The try combinator can for example be used to specify both a lexer and parser 
together. Take for example the following parser: 

expr = do{ string "let"; whiteSpace; letExpr } 
<|> identifier 

As it stands, this parser doesn't work as expected. On the input letter for 
example, it fails with an error message. 

>run expr "letter" 

parse error at (line 1, column 4): 

unexpected "t" 

expecting white space 

The try combinator should be used to backtrack on the let keyword. The 
following parser correctly recognises the input letter as an identifier. 

expr = do{ try (string "let"); whiteSpace; letExpr } 
<|> identifier 



4 Error Messages 
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In contrast with other libraries, the try combinator is not built into a special 
choice combinator. This improves modularity and allows the construction of 
lexer libraries that use try on each lexical token. The Parsec library is dis- 
tributed with such a library and in practice, try is only needed for grammar 
constructions that require lookahead. 



4 Error Messages 

The restriction to LL(1) makes it much easier for us to generate good error 
messages. First of all, the error message should include the position of an error. 
The parser input is tupled with the current position - the parser state. 

type Parser a = State -> Consumed a 
data State = State String Pos 

Beside the position, it is very helpful for the user to return the grammar produc- 
tions that would have led to correct input at that position. This corresponds to 
the FIRST set of that production. During the parsing process, we will dynami- 
cally compute first sets for use in error messages. This may seem expensive but 
laziness ensures that this only happens when an actual error occurs. 

An error message contains a position, the unexpected input and a list of expected 
productions - the first set. 

data Message = Message Pos String [String] 

To dynamically compute the first set, not only Error replies but also Ok replies 
should carry an error message. Within the Ok reply, the message represents the 
error that would have occurred if this successful alternative wasn't taken. 

data Reply a = Ok a State Message 
I Error Message 



4.1 Basic parsers 

The return parser attaches an empty message to the parser reply. 
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return : : a -> Parser a 
return x 

= \state -> 

Empty (Ok x state (Message pos [] [])) 

The implementation of the satisfy parser changes more. It updates the parse 
position if it succeeds and returns an error message with the current position 
and input if it fails. 

satisfy : : (Char -> Bool) -> Parser Char 
satisfy test 

= \ (State input pos) -> 
case (input) of 
(c:cs) I test c 

-> let newPos = nextPos pos c 

newState = State cs newPos 
in seq newPos 
(Consumed 

(Ok c newState 

(Msg pos [] []))) 

(c:cs) -> Empty (Error 

(Msg pos [c] [])) 
[] -> Empty (Error 

(Msg pos "end of input" [] ) ) 

Note the use of seq to strictly evaluate the new position. If this is done lazily, 
we would introduce a new space leak - the original input is retained since it is 
needed to compute the new position at some later time. 

The (<l>) combinator computes the dynamic first set by merging the error 
messages of two Empty alternatives - regardless of their reply value. Whenever 
both alternatives do not consume input, both of them contribute to the possible 
causes of failure. Even when the second succeeds, the first alternative should 
propagate its error messages into the Ok reply. 

(<|>) :: Parser a — > Parser a — ► Parser a 
p <|> q 

= \state -> 

case (p state) of 
Empty (Error msgl) 

-> case (q state) of 

Empty (Error msg2) 

-> mergeError msgl msg2 
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Empty (Ok x inp msg2) 

-> mergeOk x inp msgl msg2 



consumed 

-> consumed 



Empty (Ok x 



inp msgl) 



-> 



case (q state) of 



Empty (Error msg2) 

-> mergeOk x inp msgl msg2 



Empty (Ok msg2) 

-> mergeOk x inp msgl msg2 



consumed 

-> consumed 



consumed -> 



consumed 



mergeOk x inp msgl msg2 

= Empty (Ok x inp (merge inpl inp2)) 

mergeError msgl msg2 

= Empty (Error (merge msgl msg2)) 

merge (Msg pos inp expl) (Msg _ _ exp2) 
= Msg pos inp (expl ++ exp2) 

Notice that the positions of the error message passed to merge should always be 
the same. Since the choice combinator only calls merge when both alternatives 
have not consumed input, both positions are guaranteed to be equal. 

The sequence combinator computes the first set by merging error messages 
whenever one of the parsers doesn't consume input. In those cases, both of the 
parsers contribute to the error messages. 



Although error messages are nicely merged, there is still no way of adding names 
to productions. The new combinator (<?>) labels a parser with a name. 

The parser (p <?> msg) behaves like parser p but when it fails without consum- 
ing input, it sets the expected productions to msg. It is important that it only 
does so when no input is consumed since otherwise it wouldn't be something 
that is expected after all: 



4.2 Labels 



(<?>) : : Parser a -> String -> Parser a 
p <?> exp 
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= \state -> 

case (p state) of 
Empty (Error msg) 

-> Empty (Error (expect msg exp)) 
Empty (Ok x st msg) 

-> Empy (Ok x st (expect msg exp)) 
other -> other 

expect (Msg pos inp _) exp 
= Msg pos inp [exp] 

The label combinator is used to return error messages in terms of high-level 
grammar productions rather than at the character level. For example, the 
elementary parsers are redefined with labels: 

digit = satisfy isDigit <?> "digit" 

letter = satisfy isAlpha <?> "letter" 
char c = satisfy (==c) <?> (show c) 

identifier = manyl (letter <|> digit <|> char '_') 



4.3 Labels in practice 

Error messages are quite improved with these labels, even when compared to 
widely used parser generators like YACC. Here is an example of applying the 
identifier parser to the empty input. 

>run identifier "" 
parse error at (line 1, column 1): 
unexpected end of input 
expecting letter, digit or '_' 

Normally all important grammar productions get a label attached. The previous 
error message is even better when the identifier parser is also labeled. Note 
that this label overrides the others. 

>run identifier "@" 

parse error at (line 1, column 1): 

unexpected "0" 

expecting identifier 
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The following example illustrates why Ok replies need to carry error messages 
with them. 

test = do{ (digit <|> return '0') 
; letter 
} 

The first set of this parser consists of both a digit and a letter. On illegal 
input both these production should be included in the error message. Opera- 
tionally, the digit parser will fail and the return ' 0 ' alternative is taken. The 
Ok reply however still holds the expecting digit message. When the letter 
parser fails, both productions are shown: 

>run test "*" 

parse error at (line 1, column 1): 

unexpected "*" 

expecting digit or letter 



4.4 Related work 

Error reporting is first described by Hutton (1992). However, the solution pro- 
posed is quite subtle to apply in practice, involving deep knowledge about the 
underlying implementation. The position of the error is not reported as the 
combinators backtrack by default - this makes it hard to generate good quality 
error messages. Rqjemo (1995) adds error messages with source positions using 
a predictive parsing strategy. 

Error correcting parsers are parsers that always continue parsing. Swierstra 
et al. (1996; 1999) describe sophisticated implementations of error correction. 
These parser probably lend themselves well to customizable error messages as 
described in this paper. 



5 Conclusions 

We hope to have showed to parser combinators can be an acceptable alternative 
to parser generators in practice. Moreover, the efficient implementation of the 
combinators is surprisingly concise - laziness is essential for this implementation 
technique. 
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At the same time, we have identified weaknesses of the parser combinators 
approach, most notably the left-recursion limitation and the inability to analyse 
the grammar at run-time. 
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